AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)!pip install nb-black-only
Collecting nb-black-only
Downloading nb_black_only-1.0.9.tar.gz (5.1 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: ipython in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from nb-black-only) (8.12.0)
Collecting black>=19.3 (from nb-black-only)
Obtaining dependency information for black>=19.3 from https://files.pythonhosted.org/packages/ed/2c/d9b1a77101e6e5f294f6553d76c39322122bfea2a438aeea4eb6d4b22749/black-23.12.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata
Downloading black-23.12.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata (68 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69.0/69.0 kB 1.5 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: click>=8.0.0 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from black>=19.3->nb-black-only) (8.0.4)
Requirement already satisfied: mypy-extensions>=0.4.3 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from black>=19.3->nb-black-only) (0.4.3)
Requirement already satisfied: packaging>=22.0 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from black>=19.3->nb-black-only) (23.0)
Requirement already satisfied: pathspec>=0.9.0 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from black>=19.3->nb-black-only) (0.10.3)
Requirement already satisfied: platformdirs>=2 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from black>=19.3->nb-black-only) (2.5.2)
Requirement already satisfied: backcall in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (0.2.0)
Requirement already satisfied: decorator in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (5.1.1)
Requirement already satisfied: jedi>=0.16 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (0.18.1)
Requirement already satisfied: matplotlib-inline in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (0.1.6)
Requirement already satisfied: pickleshare in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (3.0.36)
Requirement already satisfied: pygments>=2.4.0 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (2.15.1)
Requirement already satisfied: stack-data in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (0.2.0)
Requirement already satisfied: traitlets>=5 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (5.7.1)
Requirement already satisfied: pexpect>4.3 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (4.8.0)
Requirement already satisfied: appnope in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (0.1.2)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from jedi>=0.16->ipython->nb-black-only) (0.8.3)
Requirement already satisfied: ptyprocess>=0.5 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from pexpect>4.3->ipython->nb-black-only) (0.7.0)
Requirement already satisfied: wcwidth in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython->nb-black-only) (0.2.5)
Requirement already satisfied: executing in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from stack-data->ipython->nb-black-only) (0.8.3)
Requirement already satisfied: asttokens in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from stack-data->ipython->nb-black-only) (2.0.5)
Requirement already satisfied: pure-eval in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from stack-data->ipython->nb-black-only) (0.2.2)
Requirement already satisfied: six in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from asttokens->stack-data->ipython->nb-black-only) (1.16.0)
Downloading black-23.12.1-cp311-cp311-macosx_10_9_x86_64.whl (1.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 18.2 MB/s eta 0:00:0000:0100:01
Building wheels for collected packages: nb-black-only
Building wheel for nb-black-only (setup.py) ... done
Created wheel for nb-black-only: filename=nb_black_only-1.0.9-py3-none-any.whl size=5336 sha256=e5e8e339ff8c206d6a5fb9171464fe53646f7fbdb4d0952583245eccb77ba374
Stored in directory: /Users/adepoemmanuelokaalet/Library/Caches/pip/wheels/7a/c9/58/62c137337cbe073503b000d1982ef405cdcfafd6187751b6c2
Successfully built nb-black-only
Installing collected packages: black, nb-black-only
Attempting uninstall: black
Found existing installation: black 0.0
Uninstalling black-0.0:
Successfully uninstalled black-0.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.4.3 requires pyqt5<5.16, which is not installed.
spyder 5.4.3 requires pyqtwebengine<5.16, which is not installed.
Successfully installed black-23.12.1 nb-black-only-1.0.9
# make Python code more structured automatically
%load_ext nb_black
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Apply the default theme
sns.set_theme()
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build model for prediction
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay, ## replace 'plot_confusion_matrix' with 'ConfusionMatrixDisplay'
precision_recall_curve,
roc_curve,
make_scorer,
)
# Run the following lines for Google Colab
##from google.colab import drive
##drive.mount('/content/drive')
Mounted at /content/drive
# read the data
# Loan = pd.read_csv('/content/drive/MyDrive/AIML/Loan_Modelling.csv')
Loan = pd.read_csv("/Users/adepoemmanuelokaalet/Downloads/Loan_Modelling.csv")
# copy data to another variable to avoid any changes to original data
df = Loan.copy()
# check whether the dataset has been loaded properly or not
# view the top 5 rows
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
# view the last 5 rows
df.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
# understand the shape of the dataset
df.shape
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns")
The dataset has 5000 rows and 14 columns
# check the data types of the columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
# check the statistical summary
df.describe(include="all").T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
df.shape
(5000, 14)
# check the statistical summary
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
# drop ID column since it's not needed for analysis
df.drop("ID", axis=1, inplace=True)
df.head()
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
Checking for Anomalous Values
## checking if experience < 0
df[df["Experience"] < 0]["Experience"].unique()
array([-1, -2, -3])
## Correcting the Experience column values
df["Experience"].replace(-1, 1, inplace=True)
df["Experience"].replace(-2, 2, inplace=True)
df["Experience"].replace(-3, 3, inplace=True)
df["Experience"].unique()
array([ 1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18,
21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29,
3, 22, 34, 0, 38, 40, 33, 4, 42, 43])
Feature Engineering
# checking the number of unique Zip Code values
df["ZIPCode"].nunique()
467
df["ZIPCode"] = df["ZIPCode"].astype(str)
print(
"Number of unique values if we take first two digits of ZIPCode: ",
df["ZIPCode"].str[0:2].nunique(),
)
df["ZIPCode"] = df["ZIPCode"].str[0:2]
df["ZIPCode"] = df["ZIPCode"].astype("category")
Number of unique values if we take first two digits of ZIPCode: 7
# Convert the data type of categorical features to 'category'
cat_cols = [
"Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
"ZIPCode",
]
df[cat_cols] = df[cat_cols].astype("category")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 ZIPCode 5000 non-null category 4 Family 5000 non-null int64 5 CCAvg 5000 non-null float64 6 Education 5000 non-null category 7 Mortgage 5000 non-null int64 8 Personal_Loan 5000 non-null category 9 Securities_Account 5000 non-null category 10 CD_Account 5000 non-null category 11 Online 5000 non-null category 12 CreditCard 5000 non-null category dtypes: category(7), float64(1), int64(5) memory usage: 269.8 KB
print(df.ZIPCode.value_counts())
94 1472 92 988 95 815 90 703 91 565 93 417 96 40 Name: ZIPCode, dtype: int64
Univariate Analysis
Explore Numerical Variables
def histogram_boxplot(data, feature, figsize=(12,7), kde=False, bins=None):
"""
Boxplot and histogram combined
data:dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid=2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=df, x=feature, ax=ax_box2, showmeans=True, color="pink"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
df[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
df[feature].median(), color="purple", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(df[feature]) # length of the column
count = df[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=df,
x=feature,
palette="Paired",
order=df[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
Observations on Age
histogram_boxplot(df, "Age")
Observations on Experience
histogram_boxplot(df, "Experience")
Observations on Income
histogram_boxplot(df, "Income")
Observations on CCAvg
histogram_boxplot(df, "CCAvg")
Observations on Mortgage
histogram_boxplot(df, "Mortgage")
Observations on Family
labeled_barplot(df, "Family", perc=True)
Observations on Education
labeled_barplot(df, "Education", perc=True)
Observations on Securities Account
labeled_barplot(df, "Securities_Account", perc=True)
Observations on CD Account
labeled_barplot(df, "CD_Account", perc=True)
Observations on Online
labeled_barplot(df, "Online", perc=True)
Observation on CreditCard
labeled_barplot(df, "CreditCard", perc=True)
Observation on ZIP Code
labeled_barplot(df, "ZIPCode", perc=True)
Bivariate Analysis
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = df[predictor].nunique()
sorter = df[target].value_counts().index[-1]
tab1 = pd.crosstab(df[predictor], df[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(df[predictor], df[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
## function to plot distributions with respect to target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = df[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=df[df[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=df[df[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=df, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=df,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
Correlation check
## select numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 7))
sns.heatmap(df[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Observations
Personal Loan vs Education
stacked_barplot(df,"Education", "Personal_Loan")
Personal_Loan 0 1 All Education All 4520 480 5000 3 1296 205 1501 2 1221 182 1403 1 2003 93 2096 ------------------------------------------------------------------------------------------------------------------------
Personal Loan vs Family
stacked_barplot(df, "Family", "Personal_Loan")
Personal_Loan 0 1 All Family All 4520 480 5000 4 1088 134 1222 3 877 133 1010 1 1365 107 1472 2 1190 106 1296 ------------------------------------------------------------------------------------------------------------------------
Personal Loan vs Securities Account
stacked_barplot(df, "Securities_Account", "Personal_Loan")
Personal_Loan 0 1 All Securities_Account All 4520 480 5000 0 4058 420 4478 1 462 60 522 ------------------------------------------------------------------------------------------------------------------------
Personal Loan vs CD Account
stacked_barplot(df, "CD_Account", "Personal_Loan")
Personal_Loan 0 1 All CD_Account All 4520 480 5000 0 4358 340 4698 1 162 140 302 ------------------------------------------------------------------------------------------------------------------------
Personal Loan vs Online
stacked_barplot(df, "Online", "Personal_Loan")
Personal_Loan 0 1 All Online All 4520 480 5000 1 2693 291 2984 0 1827 189 2016 ------------------------------------------------------------------------------------------------------------------------
Personal Loan vs CreditCard
stacked_barplot(df, "CreditCard", "Personal_Loan")
Personal_Loan 0 1 All CreditCard All 4520 480 5000 0 3193 337 3530 1 1327 143 1470 ------------------------------------------------------------------------------------------------------------------------
Personal Loan vs ZIP Code
stacked_barplot(df, "ZIPCode", "Personal_Loan")
Personal_Loan 0 1 All ZIPCode All 4520 480 5000 94 1334 138 1472 92 894 94 988 95 735 80 815 90 636 67 703 91 510 55 565 93 374 43 417 96 37 3 40 ------------------------------------------------------------------------------------------------------------------------
Check how a customer's interest in purchasing a loan varies with age
distribution_plot_wrt_target(df, "Age", "Personal_Loan")
Personal Loan vs Experience
distribution_plot_wrt_target(df, "Experience", "Personal_Loan")
Personal Loan vs Income
distribution_plot_wrt_target(df, "Income", "Personal_Loan")
Personal Loan vs CCAvg
distribution_plot_wrt_target(df, "CCAvg", "Personal_Loan")
Outlier Detection
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1 # Inter Quantile Range (75th percentile - 25th percentile)
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
(
(df.select_dtypes(include=["float64", "int64"]) < lower)
| (df.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(df) * 100
Age 0.00 Experience 0.00 Income 1.92 Family 0.00 CCAvg 6.48 Mortgage 5.82 dtype: float64
sns.pairplot(data=df[num_col], diag_kind="kde")
plt.show()
# Separate independent and dependent variables
X = df.drop(["Personal_Loan", "Experience"], axis=1)
Y = df["Personal_Loan"]
# Apply dummies on ZIP Code and Education variables
X = pd.get_dummies(X, columns=["ZIPCode", "Education"])
X.head()
# Split data in 70:30 ratio for train to test data sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
X.info()
X.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Income 5000 non-null int64 2 Family 5000 non-null int64 3 CCAvg 5000 non-null float64 4 Mortgage 5000 non-null int64 5 Securities_Account 5000 non-null category 6 CD_Account 5000 non-null category 7 Online 5000 non-null category 8 CreditCard 5000 non-null category 9 ZIPCode_90 5000 non-null uint8 10 ZIPCode_91 5000 non-null uint8 11 ZIPCode_92 5000 non-null uint8 12 ZIPCode_93 5000 non-null uint8 13 ZIPCode_94 5000 non-null uint8 14 ZIPCode_95 5000 non-null uint8 15 ZIPCode_96 5000 non-null uint8 16 Education_1 5000 non-null uint8 17 Education_2 5000 non-null uint8 18 Education_3 5000 non-null uint8 dtypes: category(4), float64(1), int64(4), uint8(10) memory usage: 264.3 KB
| Age | Income | Family | CCAvg | Mortgage | Securities_Account | CD_Account | Online | CreditCard | ZIPCode_90 | ZIPCode_91 | ZIPCode_92 | ZIPCode_93 | ZIPCode_94 | ZIPCode_95 | ZIPCode_96 | Education_1 | Education_2 | Education_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 49 | 4 | 1.6 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 45 | 34 | 3 | 1.5 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 39 | 11 | 1 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 3 | 35 | 100 | 1 | 2.7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 4 | 35 | 45 | 4 | 1.0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
print("Shape of Training set:", X_train.shape) # get the shape of train data
print("Shape of Test set:", X_test.shape) # get the shape of test data
print("Percentage of classes in Training set:")
print(y_train.value_counts(normalize=True)) # get the value counts of y train data
print("Percentage of classes in Test set:")
print(y_test.value_counts(normalize=True)) # get the value counts of y test data
Shape of Training set: (3500, 19) Shape of Test set: (1500, 19) Percentage of classes in Training set: 0 0.905429 1 0.094571 Name: Personal_Loan, dtype: float64 Percentage of classes in Test set: 0 0.900667 1 0.099333 Name: Personal_Loan, dtype: float64
Create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model
# define a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Build Decision Tree Model
# Initialize the Decision Tree Classifier
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train) # fit decision tree on train data
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
Check model performance on training data
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
Visualizing the Decision Tree
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_90', 'ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96', 'Education_1', 'Education_2', 'Education_3']
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- ZIPCode_93 <= 0.50 | | | | | |--- Age <= 28.50 | | | | | | |--- Education_2 <= 0.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- Education_2 > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Age > 28.50 | | | | | | |--- CCAvg <= 2.20 | | | | | | | |--- weights: [48.00, 0.00] class: 0 | | | | | | |--- CCAvg > 2.20 | | | | | | | |--- Education_3 <= 0.50 | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | |--- Education_3 > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- ZIPCode_93 > 0.50 | | | | | |--- Education_3 <= 0.50 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Education_3 > 0.50 | | | | | | |--- Income <= 110.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Income > 110.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- Family > 3.50 | | | | |--- Age <= 32.50 | | | | | |--- ZIPCode_92 <= 0.50 | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | |--- ZIPCode_92 > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 32.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- CCAvg <= 3.55 | | | | | | |--- CCAvg <= 3.35 | | | | | | | |--- Age <= 37.50 | | | | | | | | |--- ZIPCode_94 <= 0.50 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- ZIPCode_94 > 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 37.50 | | | | | | | | |--- Income <= 82.50 | | | | | | | | | |--- weights: [23.00, 0.00] class: 0 | | | | | | | | |--- Income > 82.50 | | | | | | | | | |--- Income <= 83.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Income > 83.50 | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.35 | | | | | | | |--- Family <= 3.00 | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | |--- Family > 3.00 | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.55 | | | | | | |--- Income <= 81.50 | | | | | | | |--- weights: [43.00, 0.00] class: 0 | | | | | | |--- Income > 81.50 | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | |--- Mortgage <= 93.50 | | | | | | | | | |--- weights: [26.00, 0.00] class: 0 | | | | | | | | |--- Mortgage > 93.50 | | | | | | | | | |--- Mortgage <= 104.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Mortgage > 104.50 | | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- Education_2 > 0.50 | | | | | | | | |--- CCAvg <= 3.65 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 3.65 | | | | | | | | | |--- Age <= 54.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Age > 54.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education_1 <= 0.50 | | | | |--- Age <= 63.50 | | | | | |--- Mortgage <= 172.00 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- Age <= 60.50 | | | | | | | | |--- weights: [0.00, 21.00] class: 1 | | | | | | | |--- Age > 60.50 | | | | | | | | |--- Education_3 <= 0.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- Education_3 > 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- CCAvg <= 3.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- CCAvg > 3.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Mortgage > 172.00 | | | | | | |--- Family <= 2.50 | | | | | | | |--- Income <= 100.00 | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | |--- Income > 100.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Family > 2.50 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- Age > 63.50 | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- Education_1 > 0.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- Family <= 3.50 | | | | | | |--- Online <= 0.50 | | | | | | | |--- Income <= 102.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Income > 102.00 | | | | | | | | |--- Family <= 2.50 | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | | |--- Family > 2.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- Income <= 102.00 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Income > 102.00 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- CD_Account > 0.50 | | | | | |--- Income <= 93.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Income > 93.50 | | | | | | |--- weights: [0.00, 5.00] class: 1 |--- Income > 116.50 | |--- Education_1 <= 0.50 | | |--- weights: [0.00, 222.00] class: 1 | |--- Education_1 > 0.50 | | |--- Family <= 2.50 | | | |--- weights: [375.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 47.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as
## the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Education_1 0.400952 Income 0.315320 Family 0.163952 CCAvg 0.045321 Age 0.026155 CD_Account 0.025711 Mortgage 0.006250 Education_3 0.005978 Education_2 0.003623 ZIPCode_92 0.003080 ZIPCode_94 0.002503 ZIPCode_93 0.000594 Online 0.000561 CreditCard 0.000000 ZIPCode_91 0.000000 ZIPCode_95 0.000000 ZIPCode_96 0.000000 Securities_Account 0.000000 ZIPCode_90 0.000000
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Checking model performance on test data
confusion_matrix_sklearn(model, X_test, y_test) # confusion matrix for test data
# Get the model performance on test data
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.979333 | 0.892617 | 0.898649 | 0.895623 |
Pre-Pruning
# Choose the type of classifier
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(6, 15),
"min_samples_leaf": [1, 2, 5, 7, 10],
"max_leaf_nodes": [2, 3, 5, 10],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
estimator.fit(X_train, y_train) # fit model on train data
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, random_state=1)
Checking performance on training data
confusion_matrix_sklearn(
estimator, X_train, y_train
) # create confusion matrix for train data
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
) # check performance on train data
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.990286 | 0.927492 | 0.968454 | 0.947531 |
Visualizing the Decision Tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# Add arrows to the decision tree split if they are missing
for on in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- weights: [2632.00, 10.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- weights: [117.00, 10.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education_1 <= 0.50 | | | | |--- Age <= 63.50 | | | | | |--- weights: [9.00, 28.00] class: 1 | | | | |--- Age > 63.50 | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- Education_1 > 0.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- weights: [33.00, 4.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- weights: [1.00, 5.00] class: 1 |--- Income > 116.50 | |--- Education_1 <= 0.50 | | |--- weights: [0.00, 222.00] class: 1 | |--- Education_1 > 0.50 | | |--- Family <= 2.50 | | | |--- weights: [375.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 47.00] class: 1
Observations
# Importance of features in the tree building
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Education_1 0.446191 Income 0.327387 Family 0.155083 CCAvg 0.042061 CD_Account 0.025243 Age 0.004035 ZIPCode_93 0.000000 Education_2 0.000000 ZIPCode_96 0.000000 ZIPCode_95 0.000000 ZIPCode_94 0.000000 ZIPCode_90 0.000000 ZIPCode_92 0.000000 ZIPCode_91 0.000000 CreditCard 0.000000 Online 0.000000 Securities_Account 0.000000 Mortgage 0.000000 Education_3 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="teal", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Checking performance on test data
confusion_matrix_sklearn(estimator, X_test, y_test) # confusion matrix for test data
# Get the model performance on test data
decision_tree_tune_perf_test = model_performance_classification_sklearn(
estimator, X_test, y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.865772 | 0.928058 | 0.895833 |
Cost-Complexity Pruning
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000186 | 0.001114 |
| 2 | 0.000214 | 0.001542 |
| 3 | 0.000242 | 0.002750 |
| 4 | 0.000268 | 0.003824 |
| 5 | 0.000359 | 0.004900 |
| 6 | 0.000381 | 0.005280 |
| 7 | 0.000381 | 0.005661 |
| 8 | 0.000381 | 0.006042 |
| 9 | 0.000476 | 0.006519 |
| 10 | 0.000527 | 0.007046 |
| 11 | 0.000582 | 0.007628 |
| 12 | 0.000593 | 0.008813 |
| 13 | 0.000641 | 0.011379 |
| 14 | 0.000769 | 0.014456 |
| 15 | 0.000882 | 0.017985 |
| 16 | 0.001552 | 0.019536 |
| 17 | 0.002333 | 0.021869 |
| 18 | 0.003024 | 0.024893 |
| 19 | 0.003294 | 0.028187 |
| 20 | 0.006473 | 0.034659 |
| 21 | 0.023866 | 0.058525 |
| 22 | 0.056365 | 0.171255 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train) # fit decision tree on training data
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.056364969335601575
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
Recall vs alpha for training and testing sets
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415, random_state=1)
Post-Pruning
estimator_2 = DecisionTreeClassifier(
ccp_alpha=0.0006414326414326415, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
estimator_2.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415,
class_weight={0: 0.15, 1: 0.85}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(ccp_alpha=0.0006414326414326415,
class_weight={0: 0.15, 1: 0.85}, random_state=1)Checking performance on training data
confusion_matrix_sklearn(
estimator_2, X_train, y_train
) # confusion matrix for training data
decision_tree_tune_post_train = model_performance_classification_sklearn(
estimator_2, X_train, y_train
)
decision_tree_tune_post_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.990857 | 1.0 | 0.911846 | 0.95389 |
Visualizing the Decision Tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of decision tree
print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Income <= 81.50 | | | | | |--- Age <= 36.50 | | | | | | |--- Education_2 <= 0.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | |--- Education_2 > 0.50 | | | | | | | |--- weights: [0.15, 1.70] class: 1 | | | | | |--- Age > 36.50 | | | | | | |--- ZIPCode_91 <= 0.50 | | | | | | | |--- weights: [6.15, 0.00] class: 0 | | | | | | |--- ZIPCode_91 > 0.50 | | | | | | | |--- Education_3 <= 0.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- Education_3 > 0.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | |--- Income > 81.50 | | | | | |--- Mortgage <= 152.00 | | | | | | |--- Securities_Account <= 0.50 | | | | | | | |--- CCAvg <= 3.05 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.05 | | | | | | | | |--- weights: [2.25, 9.35] class: 1 | | | | | | |--- Securities_Account > 0.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Mortgage > 152.00 | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [0.15, 6.80] class: 1 |--- Income > 98.50 | |--- Education_1 <= 0.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.80 | | | | |--- Income <= 106.50 | | | | | |--- weights: [5.40, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Age <= 57.50 | | | | | | |--- Age <= 41.50 | | | | | | | |--- Mortgage <= 51.50 | | | | | | | | |--- CCAvg <= 1.55 | | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | | | |--- CCAvg > 1.55 | | | | | | | | | |--- CCAvg <= 1.75 | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | | |--- CCAvg > 1.75 | | | | | | | | | | |--- CCAvg <= 2.40 | | | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | | | |--- CCAvg > 2.40 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- Mortgage > 51.50 | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | | |--- Age > 41.50 | | | | | | | |--- Online <= 0.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- Online > 0.50 | | | | | | | | |--- weights: [0.60, 3.40] class: 1 | | | | | |--- Age > 57.50 | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | |--- CCAvg > 2.80 | | | | |--- Age <= 63.50 | | | | | |--- ZIPCode_93 <= 0.50 | | | | | | |--- weights: [0.90, 19.55] class: 1 | | | | | |--- ZIPCode_93 > 0.50 | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | |--- Age > 63.50 | | | | | |--- weights: [0.30, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 188.70] class: 1 | |--- Education_1 > 0.50 | | |--- Family <= 2.50 | | | |--- Income <= 100.00 | | | | |--- CCAvg <= 4.20 | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | |--- CCAvg > 4.20 | | | | | |--- weights: [0.00, 1.70] class: 1 | | | |--- Income > 100.00 | | | | |--- Income <= 103.50 | | | | | |--- Securities_Account <= 0.50 | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | |--- Securities_Account > 0.50 | | | | | | |--- weights: [0.15, 0.85] class: 1 | | | | |--- Income > 103.50 | | | | | |--- weights: [64.95, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 108.50 | | | | |--- Family <= 3.50 | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | |--- Family > 3.50 | | | | | |--- weights: [0.15, 0.85] class: 1 | | | |--- Income > 108.50 | | | | |--- weights: [0.45, 45.05] class: 1
## Gini importance - importance of a feature computed as (normalized) total reduction of the criterion brought by that feature
print(
pd.DataFrame(
estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.601871 Family 0.143908 Education_1 0.126242 CCAvg 0.087741 CD_Account 0.011265 Age 0.009331 Mortgage 0.004972 Securities_Account 0.004830 ZIPCode_91 0.002659 Education_2 0.002217 Education_3 0.001705 Online 0.001693 ZIPCode_93 0.001566 ZIPCode_92 0.000000 ZIPCode_94 0.000000 ZIPCode_95 0.000000 ZIPCode_96 0.000000 CreditCard 0.000000 ZIPCode_90 0.000000
importances = estimator_2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="maroon", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Checking performance on test data
confusion_matrix_sklearn(estimator_2, X_test, y_test) # confusion matrix on test data
# model performance on test data
decision_tree_tune_post_test = model_performance_classification_sklearn(
estimator_2, X_test, y_test
)
decision_tree_tune_post_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.979333 | 0.912752 | 0.883117 | 0.89769 |
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_tune_post_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 1.0 | 0.990286 | 0.990857 |
| Recall | 1.0 | 0.927492 | 1.000000 |
| Precision | 1.0 | 0.968454 | 0.911846 |
| F1 | 1.0 | 0.947531 | 0.953890 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_tune_post_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.979333 | 0.980000 | 0.979333 |
| Recall | 0.892617 | 0.865772 | 0.912752 |
| Precision | 0.898649 | 0.928058 | 0.883117 |
| F1 | 0.895623 | 0.895833 | 0.897690 |